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Abstract. In two-player finite-state stochastic games of partial obser- 
vation on graphs, in every state of the graph, the players simultaneously 
choose an action, and their joint actions determine a probability distri- 
bution over the successor states. The game is played for infinitely many 
rounds and thus the players construct an infinite path in the graph. We 
consider reachability objectives where the first player tries to ensure a 
\^ ■ target state to be visited almost-surely (i.e., with probability 1) or pos- 

jyT ' itively (i.e., with positive probability), no matter the strategy of the 

^ , second player. 

We classify such games according to the information and to the power of 
randomization available to the players. On the basis of information, the 

^. ■ game can be one-sided with either (a) player 1, or (6) player 2 having 

partial observation (and the other player has perfect observation), or two- 
sided with (c) both players having partial observation. On the basis of 
randomization, (a) the players may not be allowed to use randomization 
(pure strategies), or (b) they may choose a probability distribution over 
actions but the actual random choice is external and not visible to the 
player (actions invisible), or (c) they may use full randomization. 

Our main results for pure strategies are as follows: (1) For one-sided 
games with player 2 perfect observation we show that (in contrast to full 
randomized strategies) belief-based (subset-construction based) strate- 
gies are not sufficient, and we present an exponential upper bound on 
|r% ' memory both for almost-sure and positive winning strategies; we show 

that the problem of deciding the existence of almost-sure and positive 
winning strategies for player 1 is EXPTIME-complete and present sym- 
bolic algorithms that avoid the explicit exponential construction. (2) For 
one-sided games with player 1 perfect observation we show that non- 
elementary memory is both necessary and sufficient for both almost-sure 
and positive winning strategies. (3) We show that for the general (two- 
sided) case finite-memory strategies are sufficient for both positive and 
almost-sure winning, and at least non-elementary memory is required. 
We establish the equivalence of the almost-sure winning problems for 
pure strategies and for randomized strategies with actions invisible. Our 
equivalence result exhibit serious flaws in previous results in the liter- 
ature: we show a non-elementary memory lower bound for almost-sure 
winning whereas an exponential upper bound was previously claimed. 
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1 Introduction 

Games on graphs. Two-player games on graphs play a central role in several 
important problems in computer science, such as controller synthesis [33,35], 
verification of open systems [2], realizability and compatibility checking [1,21, 
18], and many others. Most results about two-player games on graphs make 
the hypothesis of perfect observation (i.e., both players have perfect or complete 
observation about the state of the game) . This assumption is often not realistic 
in practice. For example in the context of hybrid systems, the controller acquires 
information about the state of a plant using digital sensors with finite precision, 
which gives imperfect information about the state of the plant [20, 27]. Similarly, 
in a concurrent system where the players represent individual processes, each 
process has only access to the public variables of the other processes, not to their 
private variables [37, 2] . Such problems are better modeled in the more general 
framework of partial- observation games [36-38, 16, 7] and have been studied in 
the context of verification and synthesis [30, 22] (also see [3] for pushdown partial- 
observation games). 

Partial-observation stochastic games and subclasses. In two-player 
partial-observation stochastic games on graphs with a finite state space, in ev- 
ery round, both players independently and simultaneously choose actions which 
along with the current state give a probability distribution over the successor 
states in the game. In a general setting, the players may not be able to dis- 
tinguish certain states which are observationally equivalent for them (e.g., if 
they differ only by the value of private variables). The state space is partitioned 
into observations defined as equivalence classes and the players do not see the 
actual state of the game, but only an observation (which is typically different 
for the two players). The model of partial-observation games we consider is the 
same as the model of stochastic games with signals [7] and is a standard model in 
game theory [39, 41]. It subsumes other classical game models such as concurrent 
games [40, 19], probabilistic automata [34, 9, 32], and partial-observation Markov 
decision processes (POMDPs) [31] (see also the recent decidability and complex- 
ity results for probabilistic automata [4-6,10-12,25] and for POMDPs [15,4, 
43]). 

The special case of perfect observation for a player corresponds to every ob- 
servation for this player being a singleton. Depending on which player has per- 
fect observation, we consider the following one-sided subclasses of the general 
two-sided partial-observation stochastic games: (1) player 1 partial and player 2 
perfect where player 2 has perfect observation, and player 1 has partial obser- 
vation; and (2) player 1 perfect and player 2 partial where player 1 has perfect 
observation, and player 2 has partial observation. The case where the two play- 
ers have perfect observation corresponds to the well-known perfect-information 
(perfect-observation) stochastic games [40, 17, 19]. 

Note that in a given game G, if player 1 wins in the setting of player 1 partial 
and player 2 perfect, then player 1 wins in the game G as well. Analogously, if 
player 1 cannot win in the setting of player 1 perfect and player 2 partial, then 



player 1 docs not win in the game G either. In this sense, the one-sided games are 
conservative over- and under-approximations of two-sided games. In the context 
of applications in verification and synthesis, the conservative approximation is 
that the adversary is all powerful, and hence player 1 partial and player 2 perfect 
games provide the important worst-case analysis of partial-observation games. 

Objectives and qualitative problems. In this work we consider partial- 
observation stochastic games with reachability objectives where the goal of 
player 1 is to reach a set of target states and the goal of player 2 is to prevent 
player 1 from reaching the target states. The study of partial-observation games 
is considerably more complicated than games of perfect observation. For exam- 
ple, in contrast to perfect-observation games, strategies in partial-observation 
games require both randomization and memory for reachability objectives; and 
the quantitative problem of deciding whether there exists a strategy for player 1 
to ensure that the target is reached with probability at least \ can be decided 
in NP n coNP for perfect-observation stochastic games [17], whereas the prob- 
lem is undccidable even for partial-observation stochastic games with only one 
player [32] . Since the quantitative problem is undecidable we consider the follow- 
ing qualitative problems: the almost-sure (rcsp. positive) problem asks whether 
there exists a strategy for player 1 to ensure that the target set is reached with 
probability 1 (resp. positive probability). 

Classes of strategies. In general, randomized strategies are necessary to win 
with probability 1 in a partial-observation game with reachability objective [16]. 
However, there exist two types of randomized strategies where either (i) actions 
are visible, the player can observe the action he played [16, 7], or (ii) actions are 
invisible, the player may choose a probability distribution over actions, but the 
source of randomization is external and the actual choice of the action is invisible 
to the player [26] . The second model is more general since the qualitative prob- 
lems of randomized strategies with actions visible can be reduced in polynomial 
time to randomized strategies with actions invisible, by modeling the visibility 
of actions using the observations on states. 

With actions visible, the almost-sure (resp. positive) problem was shown to be 
EXPTIME-complete (resp. PTIME-complete) for one-sided games with player 1 
partial and player 2 perfect [16], and 2EXPTIME-complete (resp. EXPTIME- 
complete) in the two-sided case [7]. For the positive problem memoryless ran- 
domized strategies exist, and for the almost-sure problem belief-based strategies 
exist (strategies based on subset construction that consider the possible current 
states of the game). 

It was remarked (without any proof) in [16, p. 4] that these results easily 
extend to randomized strategies with actions invisible for one-sided games with 
player 1 partial and player 2 perfect. It was claimed in [26] (Theorems 1 & 2) 
that the almost-sure problem is 2EXPTIME-complete for randomized strategies 
with actions invisible for two-sided games, and that belief-based strategies are 
sufficient for player 1. Thus it is believed that the two qualitative problems with 
actions visible or actions invisible are essentially equivalent. 



In this paper, we consider the class of pure strategies, which do not use 
randomization at all. Pure strategies arise naturally in the implementation of 
controllers and processes that do not have access to any source of randomization. 
Moreover we will establish deep connections between the qualitative problems 
for pure strategies and for randomized strategics with actions invisible, which 
on one hand exhibit major flaws in previous results of the literature (the remark 
without proof of [16] and the main results of [26]), and on the other hand show 
that the solution for almost-sure winning randomized strategics with actions 
invisible (which is the most general case) can be surprisingly obtained by solving 
the problem for pure strategies. 

Contributions. The contributions of the paper are summarized below. 

1. Player 1 partial and player 2 perfect. We show that both for almost-sure and 
positive winning, belief-based pure strategies are not sufficient. This implies 
that the classical approaches relying on the belief-based subset construction 
cannot work for solving the qualitative problems for pure strategies. How- 
ever, we present an optimal exponential upper bound on the memory needed 
by pure strategies (the exponential lower bound follows from the special case 
of non-stochastic games [8]). By a reduction to a perfect-observation game 
of exponential size, we show that both the almost-sure and positive prob- 
lems are EXPTIME-complcte for one-sided games with perfect-observation 
for player 2. In contrast to the previous proofs of EXPTIME upper bound 
that rely either on subset constructions or enumeration of belief-based strate- 
gies, our correctness proof relies on a novel rank-based argument that works 
uniformly both for positive and almost-sure winning. The structure of this 
construction also provides symbolic antichain-based algorithms (see [23] for 
a survey of the antichain approach) for solving the qualitative problems that 
avoids the explicit exponential construction. Thus for the important special 
case of player 1 partial and player 2 perfect we establish optimal memory 
bound, complexity bound, and present symbolic algorithmic solutions for the 
qualitative problems. 

2. Player 1 perfect and player 2 partial. 

(a) We show a very surprising result that both for positive and almost-sure 
winning, pure strategies for player 1 require memory of non-elementary 
size (i.e., a tower of exponentials). This is in sharp contrast with (i) the 
case of randomized strategies (with or without actions visible) where 
memoryless strategies are sufficient for positive winning, and with (ii) 
the previous case where player 1 has partial observation and player 2 has 
perfect observation, where pure strategies for positive winning require 
only exponential memory. Surprisingly and perhaps counter-intuitively 
when player 1 has more information and player 2 has less information, 
the positive winning strategies for player 1 require much more memory 
(non-elementary as compared to exponential). With more information 
player 1 can win from more states, but the winning strategy is much 
harder to implement. 



(b) We present a non-elementary upper bound for the memory needed by 
pure strategies for positive winning. We then show with an example that 
for almost-sure winning more memory may be required as compared 
to positive winning. Finally, we show how to combine pure strategies 
for positive winning in a recharging scheme to obtain a non-elementary 
upper bound for the memory required by pure strategies for almost-sure 
winning. Thus we establish non-elementary complete bounds for pure 
strategies both for positive and almost-sure winning. 

3. General (two- sided) case. We show that in the general case finite memory 
strategies are sufficient both for positive and almost-sure winning. The result 
is obtained essentially by a simple generalization of Kdnig's Lemma [29]. The 
non-elementary lower bound for memory follows from the special case when 
player 1 has perfect observation and player 2 has partial observation. 

4. Randomized strategies with actions invisible. For randomized strategies with 
actions invisible we present two reductions to establish connections with 
pure strategies. First, we show that the almost-sure problem for randomized 
strategies with actions invisible can be reduced in polynomial time to the 
almost-sure problem for pure strategies. The reduction requires to first es- 
tablish that finite-memory randomized strategies are sufficient in two-sided 
games. Second, we show that the problem of almost-sure winning with pure 
strategies can be reduced in polynomial time to the problem of random- 
ized strategies with actions invisible. For this reduction it is crucial that the 
actions are not visible. 

Our reductions have deep consequences. They unexpectedly imply that the 
problems of almost-sure winning with pure strategies or randomized strate- 
gies with actions invisible are polynomial-time equivalent. Moreover, it fol- 
lows that even in one-sided games with player 1 partial and player 2 perfect, 
belief-based randomized strategies (with actions invisible) arc not sufficient 
for almost-sure winning. This shows that the remark (without proof) of [16] 
that the results (such as existence of belief-based strategies) of randomized 
strategies with actions visible carry over to actions invisible is an oversight. 
However from our first reduction and our results for pure strategies it fol- 
lows that there is an exponential upper bound on memory and the problem is 
EXPTIME-complctc for one-sided games with player 1 partial and player 2 
perfect. More importantly, our results exhibit a serious flaw in the main re- 
sult of [26] which showed that belief-based randomized strategies with actions 
invisible arc sufficient for almost-sure winning in two-sided games, and con- 
cluded that enumerating over such strategies yields a 2EXPTIME algorithm 
for the problem. Our second reduction and lower bound for pure strategies 
show that the result is incorrect, and that the exponential (belief-based) up- 
per bound is far off. Instead, the lower bound on memory for almost-sure 
winning with randomized strategies and actions invisible is non-elementary. 
Thus, contrary to the general belief, there is a sharp contrast for randomized 
strategies with or without actions visible: if actions are visible, then expo- 
nential memory is sufficient for almost-sure winning while if actions are not 
visible, then memory of non-elementary size is necessary in general. 



The memory requirements are summarized in Table 1 and the results of this 
paper are shown in bold font. We explain how the other results of the table 
follow from results of the literature. For randomized strategies (with or without 
actions visible), if a positive winning strategy exists, then a mcmoryless strat- 
egy that plays all actions uniformly at random is also positive winning. Thus 
the memoryless result for positive winning strategies follows for all cases of ran- 
domized strategies. The belief-based bound for memory of almost-sure winning 
randomized strategies with actions visible follows from [16,7]. The memoryless 
strategies results for almost-sure winning for one-sided games with player 1 per- 
fect and player 2 partial are obtained as follows: when actions are visible, then 
belief-based strategies coincide with memoryless strategies as player 1 has perfect 
observation. If player 1 has perfect observation, then for memoryless strategies 
whether actions are visible or not is irrelevant and thus the memoryless result 
also follows for randomized strategics with actions invisible. Thus along with our 
results we obtain Table 1. 
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Table 1. Memory requirement for player 1 and reachability objective. 



2 Definitions 



A probability distribution on a finite set S is a function k : 5* — > [0, 1] such that 
J2ses K ( s ) ~ 1- T nc su PP or t of K is the set Supp(k) = {s £ S \ k(s) > 0}. We 
denote by 'D(S) the set of probability distributions on S. Given s e S, the Dirac 
distribution on s assigns probability 1 to s. 

Games. Given finite alphabets Ai of actions for player i (i — 1,2), a stochastic 
game on A\ , Ai is a tuple G = (Q, qo, 5) where Q is a finite set of states, go G Q is 
the initial state, and 6 : Q x A\ x A2 — > T>{Q) is a probabilistic transition function 
that, given a current state q and actions a, b for the players gives the transition 
probability d(q,a,b)(q') to the next state q'. The game is called deterministic 
if S(q 7 a,b) is a Dirac distribution for all (q,a,b) G Q x A\ x A 2 . A state q is 



absorbing if 5{q 1 a,b) is the Dirac distribution on q for all (a, b) G A\ x j4 2 . In 
sonic examples, we allow an initial distribution of states. This can be encoded 
in our game model by a probabilistic transition from the initial state. 

A player-l state is a state q where 5(q, a, b) = d(q, a, b') for all a £ Ai and all 
6, b' G A 2 . We use the notation 5(q, a, — ). Player-2 states are defined analogously. 
In figures, we use boxes to emphasize that a state is a player-2 state, and we 
represent probabilistic branches using diamonds (which are not real 'states', e.g., 
as in Fig. 1). 

In a (two-sided) partial-observation game, the players have a partial or incom- 
plete view of the states visited and of the actions played in the game. This view 
may be different for the two players and it is defined by equivalence relations ~i 
on the states and on the actions. For player i, equivalent states (or actions) are 
indistinguishable. We denote by Oi C 2^ (i = 1, 2) the equivalence classes of «$ 
which define two partitions of the state space Q, and we call them observations 
(for player i). These partitions uniquely define functions obs^ : Q — )■ Oi (i = 1, 2) 
such that q G obSi(q) for all q G Q, that map each state q to its observation for 
player i. 

In the case where all states and actions are equivalent (i.e., the relation Wj 
is the set (Q x Q) U (A\ xii)U (A2 x A2)), we say that player i is blind and 
the actions are invisible. In this case, we have Oi = {Q} because all states 
have the same observation. Note that the case of perfect observation for player i 
corresponds to the case Oi = {{qo}, {qi}, . ■ . ,{q n }} (given Q = {qo,qi,- ■ • ,<?«}), 
and a p»j b iff a = b, for all actions a, b. 

For s C Q, a G Ai, and b G A 2 , let Post 0) b(s) = [j es Supp(S(q,a,b)) denote 
the set of possible successors of q given action a and b, and let Post ,-(s) = 
UbeA 2 Post ,&(s). 

Plays and observations. Initially, the game starts in the initial state go- I n each 
round, player 1 chooses an action a G A\, player 2 (simultaneously and inde- 
pendently) chooses an action b E A 2 , and the successor of the current state q 
is chosen according to the probabilistic transition function 5(q, a, b). A play in 
G is an infinite sequence p = <7oao^o9iai^i92 • • • such that qo is the initial state 
and 5{qj,aj,bj){qj + i) > for all j > (the actions a/s and frj's are the ac- 
tions associated to the play). Its length is \p\ = 00. The length of a play prefix 
p = qoaoboqi . . . qk is \p\ = k, and its last element is Last(p) = qu- A state q G Q 
is reachable if it occurs in some play. We denote by Plays(G) the set of plays in 
G, and by Prefs(G) the set of corresponding finite prefixes. The observation se- 
quence for player i (i = 1, 2) of a play (prefix) p is the unique (in)finite sequence 
obSj(p) = 7071 . . • such that qj G 7j G Oi for all < j < \p\. 

The games with one-sided partial-observation are the special case where ei- 
ther «i is equality and hence d = {{q} q G Q} (player 1 has complete 
observation) or ss 2 is equality and hence <?2 = {{?} | ? £ Q} (player 2 has 
complete observation). The games with perfect observation are the special cases 
where ~i and «2 are equality, i.e., every state and action is visible to both 
players. 



Strategies. A pure strategy in G for player 1 is a function a : Prefs(G) — > A±. 
A randomized strategy in G for player 1 is a function a : Prefs(G) — > T>{Ai). 
A (pure or randomized) strategy a for player 1 is observation-based if for all 
prefixes p = ooOo&ogi • • • and p' = q' a b' Q q[ . . ., if aj «i a' and 6j «i 6' for all 
j > 0, and obsi(p) = obsi(p'), then a(p) = cr(p'). It is assumed that strategies 
are observation-based in partial-observation games. If for all actions a and b we 
have a k-i b and a «2 b iff a = 6 (all actions are distinguishable), then the 
strategy is action visible, and if for all actions a and b we have a «i b and a ~2 & 
(all actions are indistinguishable), then the strategy is action invisible. We say 
that a play (prefix) p = <7o«o^o'Zi • • • is compatible with a pure (resp., randomized) 
strategy er if the associated action of player 1 in step j is aj = o~(qoaobo . . . qj-i) 
(resp., a^ £ Supp(er(<7 ao&o . . -Oj-i))) for all < j < \p\. 

We omit analogous definitions of strategies for player 2. We denote by Sq, 
Sq, Sq, lie, IIq, and IIq the set of all player-1 strategies, the set of all 
observation-based player-1 strategies, the set of all pure player-1 strategies, the 
set of all player-2 strategies in G, the set of all observation-based player-2 strate- 
gies, and the set of all pure player-2 strategies, respectively. 

Remarks. 

1. The model of games with partial observation on both actions and states can 
be encoded in a model of games with actions invisible and observations on 
states only: when actions are invisible, we can use the state space to keep 
track of the last action played, and reveal information about the last action 
played using observations on the states. Therefore, in the sequel we assume 
that the actions are invisible to the players with partial observation. A play 
is then viewed as a sequence of states only, and the definition of strategies is 
updated accordingly. Note that a player with perfect observation has actions 
and states visible (and the equivalence relation «i is equality). 

2. The important special case of partial-observation Markov decision processes 
(POMDP) corresponds to the case where either all states in the game are 
playcr-1 states (player-1 POMDP) or all states are player-2 states (player-2 
POMDP). For POMDP it is known that randomization is not necessary, and 
pure strategies are as powerful as randomized strategics [14]. 

Finite-memory strategies. A player-1 strategy uses finite-memory if it can be 
encoded by a deterministic transducer (Mem, mo,a u , a n ) where Mem is a finite 
set (the memory of the strategy), too G Mem is the initial memory value, a u : 
Mem x 0i — > Mem is an update function, and a„ : Mem xOi-> £>(^4i) is a next- 
move function. The size of the strategy is the number |Mem| of memory values. If 
the current observation is o, and the current memory value is m, then the strategy 
chooses the next action according to the probability distribution a n (m,o), and 
the memory is updated to a u (m,o). Formally, (Mem, ma, a u ,a n ) defines the 
strategy a such that a(p ■ q) = a n (d u (TOo,obsi(yo)),obsi(g)) for all p £ Q* 
and q £ Q, where ct u extends ct u to sequences of observations as expected. This 
definition extends to infinite-memory strategies by dropping the assumption that 



the set Mem is finite. A strategy is memoryless if |Mem| = 1. For a strategy a, 
we denote by G a the player-2 POMDP obtained as the synchronous product 
of G with the transducer defining a. 



Objectives and winning modes. An objective (for player 1) in G is a set <f> C 
Plays(G) of plays. A play p 6 Plays(G) satisfies the objective <fr, denoted p \= (j), 
if p £ <p. Objectives are generally Borel measurable: a Borel objective is a Borel 
set in the Cantor topology [28]. Given strategies a and 7r for the two players, 
the probabilities of a measurable objective <j) is uniquely defined [44]. We denote 
by Pr^((j>) the probability that <p is satisfied by the play obtained from the 
starting state go when the strategies a and 7r are used. 

We specifically consider the following objectives. Given a set T C Q of 
target states, the reachability objective requires that the play visit the set T: 
Reach(7~) = {<7o a o^o?i • ■ • £ Plays(G) 3i > : qi G T}, and the Bilchi objective 
requires that the play visit the set T infinitely often, Biichi(T) = {'joflo^o'Zi ■ • • £ 
Plays(G) | Vi > • 3j > i : qj £ T}. Our solution for reachability objectives will 
also use the dual notion of safety objectives that require the play to stay within 
the set T: Safe(T) = {g , o a o^o ( Zi ■ • • S Plays(G) | Vz > : qt G T}. In figures, the 
target states in T are double- lined and labeled by ©. 

Given a game structure G and a state q, an observation-based strategy a 
for player 1 is almost-sure winning (resp. positive winning) for the objective <j> 
from q if for all observation-based randomized strategies n for player 2, we have 
Pr^ ,,r (0) = 1 (resp. Pr^((j>) > 0). The strategy a is sure winning if all plays 
compatible with a satisfy <f>. We also say that the state q is almost-sure (or 
positive, or sure) winning for player 1. 



Positive and almost-sure winning problems. We are interested in the problems of 
deciding, given a game structure G, a state q, and an objective <f>, whether there 
exists a {pure, randomized} strategy which is {almost-sure, positive} winning 
from q for the objective <p- For safety objectives almost-sure winning coincides 
with sure winning, however for reachability objectives they are different. The sure 
winning problem for the objectives we consider has been studied in [36, 16, 13]. 
The almost-sure winning problem for Biichi objectives can be easily reduced to 
the almost-sure winning problem for reachability objectives [4] , and the reduction 
is as follows: given a two-sided stochastic game with Biichi objective Buchi(T), 
we add an absorbing state qr, make qr the target state for the reachability 
objective, and from every state q G T we add positive probability transitions 
to qr (details and correctness proof follow from [4, Lemma 13]). The positive 
winning problem for Biichi objectives is undecidable even for POMDPs [4]. Hence 
in this paper we only focus on reachability objectives. In all our analysis, the 
counter strategies of player 2 can be restricted to pure strategies, because once 
a strategy for player 1 is fixed, then we obtain a POMDP for player 2 in which 
pure strategies are as powerful as randomized strategies [14]. 
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Fig. 1. Belief-only is not enough for positive (as well as almost-sure) reach- 
ability. A one-sided reachability game with reachability objective in which player 1 is 
blind and player 2 has perfect observation. If we consider pure strategies, then player 1 
has a positive (as well as almost-sure) winning strategy, but there is no belief-based 
memoryless positive winning strategy. 



3 One-sided Games: Player 1 Partial and Player 2 Perfect 

In Sections 3 and 4, we consider one-sided games with partial observation: one 
player has perfect observation, and the other player has partial observation. The 
player with perfect observation sees the states visited and the actions played 
in the game. We present the results for positive and almost-sure winning for 
reachability objectives along with examples that illustrate key elements of the 
problem such as the memory required for winning strategies. 

Note that the case of player 1 partial and player 2 perfect is important in 
the context of controller synthesis as it is a conservative approximation of two- 
sided games for player 1 (if player 1 wins in the one-sided game, then he also 
wins in the two-sided game). In the following example we show that for pure 
strategies belief-based strategies are not sufficient for positive as well as almost- 
sure winning. A strategy is belief-based if its memory relies only on the subset 
construction, i.e., the strategy plays only depending on the set of possible current 
states of the game which is called belief. 

Example 1. Belief-only is not enough for positive (as well as almost- 
sure) reachability. Consider the game in Fig. 1 where player 1 is blind (all 
states have the same observation except the target state, and actions are invisi- 
ble) and player 2 has perfect observation. Initially, player 2 chooses the state q\ 
or q 2 (which player 1 does not see). The belief of player 1 is thus the set {q\, q 2 } 
(see Fig. 2). We claim that the belief is not a sufficient information to win with a 
pure strategy for player 1 because the belief-based subset construction in Fig. 2 
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Fig. 2. The belief-based subset construction for the reachability game of Fig. 1. Player 1 
has a pure strategy for positive (as well as almost-sure) winning in the subset construc- 
tion. However, belief-based memoryless pure strategies are not sufficient in the original 
game. 



suggests that playing always the same action (say a) when the belief is {q±, 92} 
is an almost-sure winning strategy. However, in the original game this is not 
even a positive winning strategy (the counter strategy of player 2 is to choose 52 
initially). A winning strategy for player 1 is to alternate between a and b when 
the belief is {q\, 92}, which requires to remember more than the belief set. ■ 

We present reductions of the almost-sure and positive winning problem for 
reachability objective to the problem of sure-winning in a game of perfect ob- 
servation with Biichi objective, and reachability objective respectively. The two 
reductions are based on the same construction of a game where the state space 
L — {(s,o) I o C s C Q} contains the subset construction s enriched with obliga- 
tion sets o C s which ensure that from all states in s, the target set T is reached 
with positive probability. 

Lemma 1. Given a one-sided partial-observation stochastic game G with 
player 1 partial and player 2 perfect with a reachability objective for player 1, 
we can construct in time exponential in the size of the game and polynomial in 
the size of action sets a perfect-information deterministic game H with a Biichi 
objective (resp. reachability objective) such that player 1 has a pure almost-sure 
(resp. positive) winning strategy in G iff player 1 has a sure-winning strategy in 
H. 

Proof. We present the construction and the proof in details for almost-sure reach- 
ability. The construction is the same for positive reachability, and the argument 
is described succinctly afterwards. 

Construction. Given G = (Q, qo, 5) over alphabets A\, A2 and observation set 
0\ for player 1, with reachability objective Reach (T), we construct the follow- 
ing (deterministic) game of perfect observation H = (L,£o,8h) over alphabets 
A\,A' 2 with Biichi objective Buchi(a) defined byaCL where: 

— L = {(s, o) I o C s C Q}. Intuitively, s is the belief of player 1 and o is a set 
of obligation states that "owe" a visit to T with positive probability; 

— £0 = ({qo}, {qo}) if qo t T, and £ = (0, 0) if qo £ T\ 

— A[ — A\ x 2*2. In a pair (a, u) G A[, we call a the action, and u the witness 
set; 
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— A' 2 = 0\. In the game H, player 2 simulate player 2's choice in game G, as 
well as resolves the probabilistic choices. This amounts to choosing a possible 
successor state, and revealing its observation; 

— a = {( S ,0)eL}; 

— Sh is defined as follows. First, the state (0, 0) is absorbing. Second, in every 
other state (s, o) G L the function 5h ensures that (i) player 1 chooses a 
pair (a, it) such that Supp(S(q,a, b)) (~l u ^= for all q G o and b G A2, and 
(m) player 2 chooses an observation 7 G 0i such that Post a _(s) n 7 ^ 
0. If a player violates this, then a losing absorbing state is reached with 
probability 1. Assuming the above condition on (a,u) and 7 is satisfied, 
define 6h((s, o), (a, it), 7) as the Dirac distribution on the state (s',o') such 
that: 

• s' = (Post a! _( s )n 7 )\T; 

• d = s' if o = 0; and d = (Post ,-(o) n 7 n u) \ T if o 7^ 0. 

Note that for every reachable state (s, o) in H, there exists a unique obser- 
vation 7 G 0\ such that s C 7 (which we denote by obsi(s)). 

We show the following property of this construction. Player 1 has a pure 
observation-based almost-sure winning strategy in G for the objective Reach (7~) 
if and only if player 1 has a sure winning strategy in H for the objective Buchi(a). 

Mapping of plays. Given a play prefix pn = (sq, oo)(s±, o\) . . . (sfe, Ok) in H with 
associated actions for player 1 of the form (m, •) in step i (0 < i < k), and a play 
prefix pa = qoqi . . .% in G with associated actions a! i (0 < i < k) for player 1, 
we say that pa is matching pn if a-i — d i for all < i < k, and qi G obsi (si) for 
all < i < k. 

By induction on the length of pn, we show that (i) for each q^ G Sk there 
exists a matching play pa (which visits no 7~-state) such that Last(pG) = qk, 
and (m) for all play prefixes pg matching pn, if pc does not visit any 7~-state, 
then Last(pc) £ Sfe- 

For \ph\ = (i.e., pu — (sq,oq) where (so,oq) — £q) it is easy to see 
that pc — <Zo is a matching play with <?o g' T if and only if so = oq = {go}- 
For the induction step, assume that we have constructed matching plays for 
all play prefixes of length k — 1, and let pn = (sq, Oo)(si,Oi) . . . (sfe, Ofc) be 
a play prefix of length k in H with associated actions of the form (a,, ■) in 
step i (0 < i < k). To prove (i), pick g^ G s^. By definition of 6h, we have 
(jfe G Post afcl ._(sfc_i), hence there exists b G Ai and g^-i G Sk-i such that 
^fe G Supp(<5(<7fc-i, afc_i, &)). By induction hypothesis, there exists a play prefix 
Pg in G matching (so,Oo) • • • ( s fc-ij°fc-i) an d with Last(pc) = qk-i, which we 
can extend to pc-lk to obtain a play prefix matching pn- To prove (ii), it is 
easy to see that every play prefix matching pn is an extension of play prefix 
matching (sq,oq) . . . (sk-i,Ok—i) with a non 7~-state qk in jk = obsi(sfc) and in 
Post 0fc _ l! _(s fe -i), therefore q k G (Post afc _ 1; _(s fe -i) r\j k )\T = s k . 

Mapping of strategies, from G to H (ranking argument). First, assume that 
player 1 has a pure observation-based almost-sure winning strategy a in G for 



12 



the objective Reach(T). We construct an infinite-state MDP G a = (Q + ,po,S a ) 
where: 

— Q + is the set of nonempty finite sequences of states; 

— Po = <7o e Q; 

— 6 a '■ Q + x A2 — > V(Q + ) is defined as follows: for each p G Q + and fo G A2, 
if Last(p) ^ T then 5 a {p, b) assigns probability S(Last(p),a(p) t b)(q') to each 
p' = pq' G Q + , and probability to all other p' G Q + ; if Last(p) G T, then 
p is an absorbing state; 

We define a ranking of the reachable states of G„. Assign rank to all p G Q + 
such that Last(p) G T ■ For i = 1,2,... assign rank i to all non-ranked p such 
that for all player 2 actions b G A2, there exists p' G Supp(S a (p, b)) with a rank 
(and thus with a rank smaller than i). We claim that all reachable states of G a 
get a rank. By contradiction, assume that a reachable state p = q^qi ■ ■ .qk is 
not ranked (note that qi ^ T for each < i < k). Fix a strategy ir for player 2 
as follows. Since p is reachable in G CT , there exist actions bo, ... , 6fe-i such that 
qi+i G Supp(<5 CT (<7o • • • qi, bi)) for all < i < k. Then, define ir(qo ■ ■ -qi) — b{. This 
ensures that Last(p) is reached with positive probability in G under strategies a 
and it. From p, the strategy n continues playing as follows. If the current state 
p is not ranked (which is the case of p), then choose an action b such that all 
states in Supp(5 a (p, b)) are not ranked. The fact that p is not ranked ensures 
that such an action b exists. Now, under a and ir all paths from Last(/3) in G 
avoid T-sates. Hence the set T is not reached almost-surely, in contradiction 
with the fact that a is almost-sure winning. Hence all states in G a get a rank. 
We denote by Rank(p) the rank of a reachable state p in G a . 

From the strategy a and the ranking in G a , we construct a strategy a 1 in the 
game H as follows. Given a play pu — (so, Oo)(si, o{) . . . (sk, Ofe) in H (with Sk 7^ 
0), define a'(pn) — (a, w) where a — <j(pg) for a play prefix pa matching pn and 
u = {q G Supp(S(Last(po),a,b)) \ b G A 2l pa is matching pn with Last(yOc) & 
Ok and Rank(/9G.q) < Rankle)} is a witness set which selects successor states 
of Ok with decreased rank along each branch of the MDP G a . 

Note that all matching play prefixes pq have the same observation sequence. 
Therefore, the action a = o~(pg) is unique and well-defined since a is an 
observation-based strategy. Note also that the pair (a, u) is an allowed choice 
for player 1 by definition of the ranking, and that for each q G Ok, all match- 
ing play prefixes pa with Last(/9c) = q have the same rank in G a . Therefore 
we abuse notation and write Rank(g) for Rankle): assuming that the set Ok to 
which q belongs is clear from the context. Let MaxRank(ofe) = max geofc Rank(g). 
If Ok 7^ 0, then MaxRank(ofe + i) < MaxRank(ofc) since Ok+i Q u (by definition of 
Sh)- 

Correctness of the mapping. We show that a' is sure winning for Buchi(o:) in H. 
Fix an arbitrary strategy w' for player 2 in H and consider an arbitrary play 
pH = (so, oo)(si, 01) . . . compatible with a' and n'. By the properties of the 
witness set played by a', for each pair (si,Oi) with o; ^ 0, an a-pair (-,0) is 



13 



reached within at most MaxRank(oi) steps. And by the properties of the mapping 
of plays and strategies, if Oi = then Oj+i = Sj+i contains only states from which 
a is almost-sure winning for Reach (7~) in G and therefore have a finite rank, 
showing that MaxRank(oi+i) is defined and finite. This shows that an a-pair is 
visited infinitely often in pu and a' is sure winning for Biichi(a). 

Mapping of strategies, from H to G. Given a strategy a' in H, we construct a 
pure observation-based strategy a in G. 

We define o~(pg) by induction on the length of pq. In fact, we need to define 
o~(pc) only for play prefixes pa which are compatible with the choices of a for 
play prefixes of length smaller than \pa\ (the choice of a for other play prefixes 
can be fixed arbitrarily). For all such pa, our construction is such that there 
exists a play prefix pu — Q{pg) compatible with a 1 such that po is matching 
Ph, and if ct(pq) = a and o~'(ph) = («', •)> then a — a' (*). 

We define a and #(•) as follows. For \pg\ = (i.e., pa = qo), let pu = 9(pg) = 
(s ,o ) where s = o = {qo} if qo & T, and s = o = if qo 6 T, and let 
°~{pg) = a if o~'(ph) = (a, •)• Note that property (*) holds. For the induction step, 
let k > 1 and assume that from every play prefix pa of length smaller than k, we 
have defined o~(pg) and 0(pg) satisfying (*). Let po — q$qi ... % be a play prefix 
in G of length k. Let pn = 0(qoqi ■ ■ -Qk-i) an d 7fe = obsi(gfe), and let (sfc,Ofc) be 
the (unique) successor state in the Dirac distribution #ff(Last(/9£f),cr'(/9ff),7fc). 
Note that % e Sfe . Define ^(/Og) = pH-{s k ,o k ) and er(/9 G ) = a if a'(pH-(sk,Ok)) = 
(a,-). Therefore, the property (■*•) holds. 

Note that the strategy a is observation-based because if obsi(pG) = obSi(p G ), 
then 9(p G ) = 0(p' G ). 

Correctness of the mapping. If player 1 has a sure winning strategy a' in H 
for the objective Biichi(a), then we can assume that a' is memoryless (since 
in perfect-observation deterministic games with Biichi objectives memoryless 
strategies are sufficient for sure winning [24,42]), and we show that the strategy 
a defined above is almost-sure winning in G for the objective Reach (7~). 

Since a' is memoryless and sure winning for Biichi(a), in every play compati- 
ble with a' there are at most n = \L\ < 3^ steps between two consecutive visits 
to an a-state. 

The properties of matching plays entail that if a play prefix pq compatible 
with a has no visit to T-states, and (s,o) = Last(8(p G j) 7 then Last(p G ) 6 s. 
Moreover if s = o, then under strategy a for player 1 and arbitrary strategy 7r 
for player 2, there is a way to fix the probabilistic choices such that all plays 
extension of pa visit a T-state. To see this, consider the probabilistic choices 
given at each step by the witness component u of the action (•, u) played by a' . 
By the definition of the mapping of plays and of the transition function in H , 
it can be shown that if (si, Oj)(s,+i, Oj+i) . . . (sfc, Ok) is a play fragment of 8{pg) 
(hence compatible with a') where Si = o% and Oj ^ for all i < j < k, then the 
"owe" set Ofc is the set of all states that can be reached in G from states Sj along a 
path which is compatible with both the action played by the strategy a' (and a) 
and the probabilistic choices fixed by a', and visits no T-states. Since the "owe" 
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set gets empty within at most n steps regardless of the strategy of player 2, 
all paths compatible with the probabilistic choices must visit an T-state. This 
shows that under any player 2 strategy within n steps, a T-state is visited with 
probability at least r" where r > is the smallest non-zero probability occurring 
in G. Therefore, the probability of not having visited a 7~-state after z ■ n steps 
is at most (1 — r n ) z which vanishes for z — > oo since r n > 0. Hence, against 
arbitrary strategy of player 2, the strategy a ensures the objective Reach(T) 
with probability 1. 

Argument for positive reachability. The proof for positive reachability follows 
the same line as for almost-sure reachability, with the following differences. The 
construction of the game of perfect information H is now interpreted as a reacha- 
bility game with objective Reach(a). The mapping of plays is the same as above. 
In the mapping of strategics from G to H, we use the same ranking construction, 
but we only claim that the initial state gets a rank. The argument is that if the 
initial state would get no rank, then player 2 would have a strategy to ensure 
that all paths avoid the target states, in contradiction with the fact that player 1 
has fixed a positive winning strategy. The rest of the proof is analogous to the 
case of almost-sure reachability. □ 

It follows from the construction in the proof of Lemma 1 that pure strate- 
gies with exponential memory are sufficient for positive (as well as almost-sure) 
winning, and the exponential lower bound follows from the special case of non- 
stochastic games [8] . Lemma 1 also gives EXPTIME upper bound for the problem 
since perfect-observation Biichi games can be solved in polynomial time [42] . The 
EXPTIME-hardness follows from the sure winning problem for non-stochastic 
games [37], where pure almost-sure (positive) winning strategies coincide with 
sure winning strategies. We have the following theorem summarizing the results. 

Theorem 1. Given one-sided partial-observation stochastic games with player 1 
partial and player 2 perfect, the following assertions hold for reachability objec- 
tives for player 1: 

1. (Memory complexity). Belief-based pure strategies are not sufficient both for 
positive and almost-sure winning; exponential memory is necessary and suf- 
ficient both for positive and almost-sure winning for pure strategies. 

2. (Algorithm). The problems of deciding the existence of a pure almost-sure 
and a pure positive winning strategy can be solved in time exponential in the 
state space of the game and polynomial in the size of the action sets. 

3. (Complexity). The problems of deciding the existence of a pure almost-sure 
and a pure positive winning strategy are EXP TIME- complete. 

Symbolic algorithms. The exponential Biichi (or reachability) game constructed 
in the proof of Theorem 1 can be solved by computing classical fixpoint formu- 
las [24]. However, it is not necessary to construct the exponential game struc- 
ture explicitly. Instead, we can exploit the structure induced by the pre-order ^ 
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Fig. 3. Remembering the belief of player 2 is necessary. A one-sided reachability 
game where player 1 (round states) has perfect observation, player 2 (square states) is 
blind. Player 1 has a pure almost-sure winning strategy that depends on the belief of 
player 2 (in (72), but no pure memoryless strategy is almost-sure winning. 



defined by (s, o) -< (s' , o') if (i) s C s' , (ii) o C o', and (Hi) = iff o' = 0. Intu- 
itively, if a state (s', o') is winning for player 1, then all states (s, o) ;< (s', of) are 
also winning because they correspond to a better belief and a looser obligation. 
Hence all sets computed by the fixpoint algorithm are downward-closed and thus 
they can be represented symbolically by the antichain of their maximal elements 
(sec [16] for details related to antichain algorithms). This technique provides a 
symbolic algorithm without explicitly constructing the exponential game. 

4 One-sided Games: Player 1 Perfect and Player 2 Partial 

Recall that we are interested in finding a pure winning strategy for player 1. 
Therefore, when we construct counter-strategies for player 2, we always assume 
that player 1 has already fixed a pure strategy. This is important for the way the 
belief of player 2 is updated. Although player 2 does not have perfect information 
about the actions played by player 1, the belief of player 2 can be updated 
according to the precise actions of player 1 because the response and the counter- 
strategy of player 2 is designed after player 1 has fixed a strategy. 

4.1 Lower bound on memory 

We present the following examples to illustrate two properties of the problem. 

Example 2. Remembering the belief of player 2 is necessary. We present 
an example of a game where player 1 has perfect observation but needs to re- 
member the belief of player 2 to ensure positive or almost-sure reachability. The 



16 




Fig. 4. A one-sided reachability game L n with reachability objective in which player 1 
is has perfect observation and player 2 is blind. Player 1 needs exponential memory to 
win positive reachability. 



game is shown in Fig. 3. The target is T — {(?©}• Player 2 is blind. If player 2 
chooses a in the initial state go, then his belief will be {q\, (72}, and if he plays b, 
then his belief will be {92,93}- In (?2, the choice of player 1 depends on the belief 
of player 2. If the belief is {q\, 92}, then playing a in (72 is not a good choice 
because the belief of player 2 would be {q\\ and player 2 could surely avoid g© 
by further playing b. For symmetrical reasons, if the belief of player 2 is {(72, 93} 
in <72, then playing b is not a good choice for player 1. Therefore, there is no pos- 
itively winning memoryless strategy for player 1. However, we show that there 
exists an almost-sure winning belief-based strategy for player 1 as follows: in 92, 
play b if the belief of player 2 is { (71,(72}, and play a if the belief of player 2 is 
{92, <l3} ■ Note that player 1 has perfect observation and thus can observe the 
actions of player 2. This ensures the next belief of player 2 to be {(73,94} and 
therefore no matter the next action of player 2, the state (7© is reached with 
probability i. Repeating this strategy ensures to reach g© with probability 1. ■ 

Example 3. Memory of non-elementary size may be necessary for posi- 
tive and almost-sure reachability. We show that player 1 may need memory 
of non-elementary size to win positively (as well as almost-surely) in a reachabil- 
ity game. We present a family of one-sided games G n where player 1 has perfect 
observation, and player 2 has partial observation both about the state of the 
game, and the actions played by player 1. We explain the example step by step. 
The key idea of the example is that the winning strategy of player 1 in game G n 
will need to simulate a counter systems (with n integer- valued counters) where 
the operations on counters are increment and division by 2 (with round down), 
and to reach strictly positive counter values. 

Counters. First, we use a simple example to show that counters appear naturally 
in the analysis of the game under pure strategies. 

Consider the family of games (L„)„ 6 n shown in Fig. 4, where the reachability 
objective is Reach({go})- In the first part, the states L and R are indistinguish- 
able for player 2. Consider the strategy of player 1 that plays b in L and R. Then, 
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(•,-,•,+1) (-,-, +1,-^2) (-,+1,-2,-2) (+1,-2,-2,-2) 

-, A^i) 

(-2,-2,-2,-2) \~J 
[0,0, 0,2 2 ^] [0,0,211,2"] \^^^\ [2,2,2,2] [1,1,1,1] 

Fig. 5. A family (C n ) n en of counter systems with n counters and n + 1 states where 
the shortest execution to reach (go, fci, . . . , fc ra ) with positive counters (i.e., ki > for 
all 1 < i < n) from (g n , 0, . . . , 0) is of non-elementary length. The numbers above the 
self- loops show the number of times each self- loop is taken along the shortest execution. 




the state q n is reached by two play prefixes p up = qiLq n and pd w — QiRQn that 
player 2 cannot distinguish. Therefore, player 2 has to play the same action in 
both play prefixes, while perfectly-informed player 1 can play different actions. 
In particular, if player 1 plays a in p up and b in Pdw, then no matter the action 
chosen by player 2 the state q n -i is reached with positive probability. However, 
because only one play prefix reaches q n -i, this strategy of player 1 cannot ensure 
to reach q n -2 with positive probability. 

Player 1 can ensure to reach q n -2 (and qo) with positive probability with 
the following exponential-memory strategy. For the first n — 1 visits to either L 
or R, play b, and on the nth visit, play a. This strategy produces 2 n different 
play prefixes from qi to q n , each with probability ^-. Considering the mapping 
L h^ a, R i-)- b, each such play prefix p is mapped to a sequence w p of length n 
over {a, b} (for example, the play prefix qjLqjRqiLq n is mapped to aba). The 
strategy of player 1 is to play the sequence w p in the next n steps after p. This 
strategy ensures that for all < i < n, there are 2 l play prefixes which reach qi 
with positive probability, all being indistinguishable for player 2. The argument 
is an induction on i. The claim is true for i — n, and if it holds for i = k, then no 
matter the action chosen by player 2 in q^, the state qt-i is reached with positive 
probability by half of the 2 k play prefixes, i.e. 2 fe_ i play prefixes. This establishes 
the claim. As a consequence, one play prefix reaches qo with positive probability. 
This strategy requires exponential memory, and an inductive argument shows 
that this memory is necessary because player 1 needs to have at least 2 play 
prefixes that are indistinguishable for player 2 in state qi, and at least 2 4 play 
prefixes in q^ for all < i < n. 

N on- elementary counters. Now, we present a family C n of counter systems where 
the shortest execution is of non-elementary length (specifically, the shortest 

2 

length is greater than a tower 2 2 of exponentials of height n). The counter 
system C\ (for n — 4) is shown in Fig. 5. The operations on counters can be 
increment (+1), division by 2 (-^2), and idle (•). In general, C n has n counters 
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Fig. 6. Gadgets to simulate idle, increment, and division by 2. 



ci, . . . , c„ and n + 1 states qo,. . . ,q n . In state (ft of C ra (0 < i < n), the counter 
c, can be incremented and at the same time all the counters Cj for j > i are di- 
vided by 2. From q n , to reach qo with strictly positive counters (i.e., all counters 
have value at least 1), we show that it is necessary to execute the self- loop on 
state q n a non-elementary number of times. In Fig. 5, the numbers above the 
self- loops show the number of times they need to be executed. When leaving qi , 
the counters need to have value at least 2 in order to survive the transition to go 
which divides all counters by 2. Since the first counter can be incremented only 
in state q\, the self- loop in q\ has to be executed 2 times. Hence, when leaving 
92, the other counters need to have value at least 2 • 2 2 = 2 3 in order to survive 
the self- loops in q\. Therefore, the self- loop in qi is executed 2 3 times. And so 
on. In general, if the self-loop on state qi is executed k times (in order to get 
Cj = k), then the counters Cj+i, . . . , c n need to have value k ■ 2 k when entering 
qi (in order to guarantee a value at least k of these counters). In q n , the last 
counter c„ needs to have value /™(1) where f n is the nth iterate of the function 
/ : N — >• N : x 1— > x ■ 2 X . This value is greater than a tower of exponentials of 
height n. 

Gadgets for increment and division. In Fig. 6, we show the gadgets that are 
used to simulate operations on counters. The gadgets are game graphs where the 
player-1 actions a, b are indistinguishable for player 2 (but player 2 can observe 
and distinguish the action #). The actions a, b are used by player 1 to simulate 
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the operations on the counters. The # is used to simulate the transitions from 
state qi to gj_i of the counter system of Fig. 5. All states of the gadgets have 
the same observation for player 2. Recall that player 1 has perfect observation. 

The idle gadget is straightforward. The actions a, b have no effect. In the 
other gadgets, the value of the counters is represented by the number of paths 
that arc indistinguishable for player 2, and that end up in the entry state of the 
gadget (for the value of the counter before the operation) or in the exit state 
(for the value of the counter after the operation) . 

Consider the division gadget div 2 . If player 2 plays an action that matches the 
choice of player 1 , then the game leaves the gadget and the transition will go to 
the initial state of the game we construct (which is shown on Fig. 8). Otherwise, 
the action of player 2 does not match the action of player 1 and the play reaches 
the exit state of the gadget. Let k be the number of indistinguishable 3 paths in 
the entry state of the gadget. By playing a after k\ such paths and b after fc 2 
paths (where k\ + fc 2 = k), player 1 ensures that min{fci,fc 2 } indistinguishable 
paths reach the exit state of the gadget (because in the worst case, player 2 
can choose his action to match the action of player 1 over max{£;i,fc 2 } paths). 
Hence, player 1 can ensure that |_§ J indistinguishable paths get to the exit state. 
In the game of Fig. 8, the entry and exit state of division gadgets are merged. 
The argument still holds. 

Consider the increment gadget inc on Fig. 6. We use this gadget with the 
assumption that the entry state is not reached by more than one indistinguish- 
able path. This will be the case in the game of Fig. 8. Player 1 can achieve k 
indistinguishable paths in the exit state as follows. In state q a b, play action a if 
the last visited state is qr,, and play action b if the last visited state is qr. No 
matter the choice of player 1, one path will reach the exit state, and the other 
path will get to the entry state. Repeating this scenario k times gives k paths in 
the exit state. We show that there is essentially no faster way to obtain k paths 
in the exit state. Indeed, if player 1 chooses the same action (say a) after the two 
paths ending up in q a b, then against the action b from player 2, two paths reach 
the exit state, and no state get to the entry state. Then, player 1 can no longer 
increment the number of paths. Therefore, to get k paths in the exit state, the 
fastest way is to increment one by one up to k — 2, and then get 2 more paths as 
a last step. Note that it is not of the interest of player 2 to match the action of 
player 1 if player 1 plays the same action, because this would double the number 
of paths. 

Structure of the game. The game G n which requires memory of non-elementary 
size is sketched in Fig. 8 for n = 3. Its abstract structure is shown in Fig. 7, 
corresponding to the structure of the counter system in Fig. 5. The alphabet of 
player 1 is {a, b, #}. For the sake of clarity, some transitions are not depicted 
in Fig. 8. It is assumed that for player 1, playing an action from a state where 
this action has no transition depicted leads to the initial state of the game. For 



3 In the rest of this section, the word indistinguishable means indistinguishable for 
player 2. 
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Fig. 7. Abstract view of the game in Fig. 8 as a 3-counter system. 



example, playing # in state (74 goes to the initial state, and from the target state 
g@, all transitions go to the initial state. 

Fig. 8 shows the initial state qi of the game from which a uniform proba- 
bilistic transition branches to the three states qi, r?, S7. The idea of this game is 
that player 1 needs to ensure that the states qi,r\,s\ are reached with positive 
probability, so as to ensure that no matter the action (a, 6, or c) chosen by 
player 2, the state g@ is reached with positive probability. From qi,r\,Si, the 
other actions of player 2 (i.e., b and c from q\, a and c from r 1: etc.) lead to 
the initial state. Player 2 can observe the initial state. All the other states are 
indistinguishable . 

Intuitively, each "foe" of states (q's, r's, and s's) simulate one counter. Syn- 
chronization of the operations on the three counters is ensured by the special 
(and visible to player 2) symbol #. Intuitively, since # is visible to player 2, 
player 1 must play # at the same "time" in the three lines of states (i.e., af- 
ter the same number of steps in each line). Otherwise, player 2 may eliminate 
one line of states from his belief. For example, if player 1 plays # in the first 
step in lines q and r, but not in line s, then player 2 observing # can safely 
update his belief to {q. , r.}, and thus avoid to play c when one of the states qi, 
ri is reached. In Fig. 8, the dotted lines and the subscripts on # emphasize the 
layered structure of the game, corresponding to the structure of Fig. 7. 

From all the above, it follows that player 1 needs memory of size non- 
elementary in order to ensure indistinguishable paths ending up in each of the 
states qi,ri,si, and win with positive probability. Since all other paths are going 



21 



back to the initial state, this strategy can be repeated over and over again to 
achieve almost-sure reachability as well. I 

Theorem 2. In one-sided partial- observation stochastic games with player 1 
perfect and player 2 partial, both pure almost-sure and pure positive win- 
ning strategies for reachability objectives for player 1 require memory of non- 
elementary size in general. 

4.2 Upper bound for positive reachability with almost-sure safety 

We present the solution of one-sided games with a conjunction of positive reacha- 
bility and almost-sure safety objectives, in which player 1 has perfect observation 
and player 2 has partial observation. This will be useful in Section 4.3 to solve 
almost-sure reachability, and using a trivial safety objective (safety for the whole 
state space) it also gives the solution for positive reachability. 

Let G = (Q, qo, 5q) be a game over alphabets A\, A-i and observation set Oi 
for player 2, with reachability objective Reach(T) (where T C Q) and safety 
objective Safe(QG) (where Qo C Q represents a set of good states) for player 1. 
We assume that the states in T are absorbing and that T C Qq. This assumption 
is satisfied by the games we consider in Section 4.3, as well as by the case of 
a trivial safety objective (Qg = Q)- The goal of player 1 is to ensure positive 
probability to reach T and almost-sure safety for the set Qg- 

Before presenting the algorithm for solving these games in pure strategies, 
we consider the case of randomized strategies. After, we use the results of ran- 
domized strategics to solve the case of pure strategies. 

Step 1 - Winning with randomized strategies. First, we show that with 
randomized strategies, memoryless strategies are sufficient. It suffices to play 
uniformly at random the set of safe actions. In a state q, an action a G A\ is 
safe if Postc(q,a,b) C Win sa / e for all b e A%, where Win sa / e is the set of states 
that are sure winning 4 for player 1 in G for the safety objective Safe(Qo)- This 
strategy ensures that the set Q \ Qg of bad states is never reached, and from 
the positive winning region of player 1 for Reach (7") it ensures that the set T 
is reached with positive probability. Therefore, computing the set Z of states 
that arc winning for player 1 in randomized strategies can be done by fixing 
the uniformly randomized safe strategy for player 1, and checking that player 2 
does not almost-surely win the safety objective Safe(Q \ T), which requires the 
analysis of a POMDP for almost-sure safety and can be done in exponential time 
using a simple subset construction [15, Theorem 2]. 

Note that TCZ and that from all states in Z, player 1 can ensure that T is 
reached with positive probability within at most 2^1 steps, while from any state 
q g" Z , player 1 cannot win positively with a randomized strategy, and therefore 
also not with a pure strategy. 



4 Note that for safety objectives, the notion of sure winning and almost-sure winning 
coincide, and pure strategies are sufficient. 
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Fig. 8. Memory of non-elementary size may be necessary for positive and 
almost-sure reachability. A family of one-sided reachability games in which player 1 
is has perfect observation. Player 1 needs memory of non-elementary size to win positive 
reachability (as well as almost-sure reachability). 
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Step 2 - Pure strategies to simulate randomized strategies. Second, we 
show that pure strategies can in some cases simulate the behavior of randomized 
strategics. As we have seen in the gadget inc of Fig. 6, if there are two play 
prefixes ending up in the same state and that are indistinguishable for player 2 
(e.g., qoLq a b and qoRq a b in the example), then player 1 can simulate a random 
choice of action over support {a, b} by playing a after q a Lq a i,, and playing b after 
qoRq a b- No matter the choice of player 2, one of the plays will reach q and the 
other will reach the exit state of the gadget. Intuitively, this corresponds to a 
uniform probabilistic choice of the actions a and b: the state qo and the exit state 
are reached with probability 75 ■ 

In general, if there are \Ai\ indistinguishable play prefixes ending up in the 
same state q, then player 1 can simulate a random choice of actions over Ai from 
q. However, the number of indistinguishable play prefixes in a successor state q' 
may have decreased by a factor \Ai\ (there may be just one play reaching q'). 
Hence, in order to simulate a randomized strategy during k steps, player 1 needs 
to have \Ai\ k indistinguishable play prefixes. Since 2' < 5I steps are sufficient for 
a randomized strategy to achieve the reachability objective, an upper bound on 
the number of play prefixes that arc needed to simulate a randomized strategy 
using a pure strategy isl\lum = |Ai| 2 . More precisely, if the belief of player 2 
is B C Z and in each state q G B there are at least Num indistinguishable play 
prefixes, then player 1 wins with a pure strategy that essentially simulates a 
winning randomized strategy (which exists since q G Z) for 2™ steps. 

Step 3 - Counting abstraction for pure strategies. We present a construc- 
tion of a game of perfect observation H such that player 1 wins in H if and only 
if player 1 wins in G. The objective in H is a conjunction of positive reachability 
and almost-sure safety objectives, for which pure memoryless winning strategies 
exist: for every state we restrict the set of actions to safe actions, and then we 
solve positive reachability on a perfect-observation game. The result follows since 
for perfect-observation games pure memoryless positive winning strategies exist 
for reachability objectives [17]. 

State space. The idea of this construction is to keep track of the belief set B C Q 
of player 2, and for each state q G B, of the number of indistinguishable play 
prefixes that end up in q. For k G N, we denote by [k] the set {0, 1, . . . , k). A 
state of H is a counting function f : Q — > [K*] U {w} where K* G N is of order 

.l^ll 2 

\Ai ' *l where the number of nested exponentials is in 0(n) (where 

n = \Q\). 

As we have seen in the example of Fig. 8, it may be necessary to keep track 
of a non-elementary number of play prefixes. We show that the bound K„ is 
sufficient, and that we can substitute larger numbers by the special symbol u> 
to obtain a finite counting abstraction. The belief associated with a counting 
function / is the set Supp(/) = {q G Q f(q) 7^ 0}, and the states q such that 
f(q) = u> are called ui -states. 
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Action alphabet. In H , an action of player 1 is a function a : Q x [K»] — >• Ai that 
assigns to each copy of a state in the current belief (of player 2), the action played 
by player 1 after the corresponding play prefix in G. We denote by Supp(a(g, •)) = 
{a(q, i) | i £ [K*]} the set of actions played by a in q € Q. 

The action set of player 2 in the game H is the same as in G. 

Transitions. Let l(a,A) be 1 if a <G A, and if a $ A. We denote this function 
by l(a E A). Given / and a as above, given an action b £ A^ and an observation 
7 € 02, let /' = Succ(/, a, b, 7) be the function such that f'(q') = for all q' $■ 7, 
and such that for all q' e 7: 

, _ f w if 3a G Supp(a(g, •)) • 3o e Q : /(g) = w A g' e Post G (g, a, 6) 
^ ' |^ x otherwise 

where x = E ge sup P (/) E/io" 1 ifa' e p ° st G(<7, a(q, i),b)). 

Note that if the current state q is an w-state, then only the support Supp(a(g, •)) 
of the function a matters. 

Now /' = Succ(/, a, b, 7) may not be a counting function because it may 
assign values greater than K* to some states. We show that beyond certain 
bounds, it is not necessary to remember the exact value of the counters and we 
can replace such large values by to. Intuitively, the lu value can be interpreted 
as "very large and definitely positive value" . This abstraction needs to be done 
carefully in order to obtain the desired upper bound (namely, K*). When a 
counter /(g) has value u, the successors of g have value oj according to Succ(-), 
which is faithful if the exact value of the counter /(g) is large enough. In fact, 
large enough means that the counter has value at least \A\\ as this allows player 1 
to play each action at least once. Hence the abstraction remains faithful during 
K steps if the counters with value greater than \Ai\ K are set to w. We know 
that if all counters have value greater than Ki = |Ai| 2 , then player 1 wins by 
simulating a randomized strategy. Therefore, when all counters but one have 
already value ui, we set the last counter to to if it has value greater than K x . 
Since this can take at most V<\ steps, the other counters with value u> need to 
have value at least K 2 = Kx ■ |^4_! | Kl . 

Therefore, when all counters but two have already value to, whenever a 
counter gets value greater than K2 we set it to u>. This can take at most 
(K2) 2 steps and the other counters with value ui need to have value at least 
K3 = K2 • |Ai|( K2 ) . In general, when all counters but k have value u>, we set a 

counter to u> if it has value at least Kfe+i = K^ • |j4i|( Kfc ) . It can be shown by 

,o(n) 

.i^ii 2 

induction that K& is of order |Ai|l yll l where the tower of exponential is 

of height k, and thus we do not need to store counter values greater than K*. 
We define the abstraction mapping /' = Abs(/) for / : Q — > N as follows: 

Let k = \{q I /(g) = u>}\ be the number of counters with value co 
in /. If there is a state g with finite value /(g) greater than K„_fe 7 
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then f'(q) — to and /' agrees with / on all states except q (i.e., 
/'(<?) = f(q) for all q + q). Otherwise, /' = /. 

Actually, we define Abs(/) as the nth iterate of the above procedure. Given 
/, a, and b, let Sn(f, &, b) be the uniform distribution over the set of count- 
ing functions /' such that there exists an observation 7 G O2 such that 
/' = Abs(Succ(/, a, b, 7)) and Supp(/') ^ 0. 

Note that the operators Succ(-) and Abs(-) are monotone, that is / < /' 
implies Abs(/) < Abs(/') as well as Succ(/, a, b, 7) < Succ(/',a, 6,7) for all 
a, b, 7 (where < is the componentwise order). 

Objective. Given T C Q and Qg Q Q defining the reachability and safety 
objectives in G, the objective in the game H is a conjunction of positive 
reachability and almost-sure safety objectives, defined by Reach (Th) where 5 
Th = {f I Supp(/) C Z AV ? e Supp(/) :' f(q) = u} U {/ | Supp(J) flT # 0} and 
by Safe(Goodff) where Good# = {/ | Supp(J) C Qg}- 

Step 4 - Correctness argument. First, assume that there exists a pure win- 
ning strategy a for player 1 in G, and we show how to construct a winning 
strategy cr H in H. As we play the game in G using a, we keep track of the exact 
number of indistinguishable play prefixes ending up in each state. This allows 
to define the action a to play in H by collecting the actions played by a in 
all the indistinguishable play prefixes. Note that by monotonicity, the counting 
abstractions in the corresponding play prefix of H are at least as big (assuming 
u) > k for all k £ N), and thus the action a is well-defined. Since a is winning, T 
is reached with positive probability in G, and the set Q \ Qg is never hit, and 
therefore a counting function f £ Th (such that Supp(/) fl T ^ 0) is reached 
with positive probability in H, and all plays remain safe in the set Good# . 

Second, assume that there exists a winning strategy a H for player 1 in H, 
and we show how to construct a pure winning strategy a in G. We can assume 
that a H is pure memoryless. Fix an arbitrary strategy it for player 2 and consider 
the unfolding tree of the game H when a H and it are fixed (we get a tree and 
not just a path because the game is stochastic). In this tree, there is a shortest 
path to reach Th and this path has no loop since strategy a H is memoryless. 
we show that the length of this path can be bounded, and that the bounds used 
in the counting abstraction with w's are faithful, showing that the strategy a H 
can be simulated in G (in particular, we need to show that there are sufficiently 
many indistinguishable play prefixes in G to simulate the action 'functions' a 
played by a H ). More precisely, the bounds Ki, K2, . . . have been chosen in such 
a way that counters with value u> keep a positive value until all counters get 
value ui. For example, when all counters but k have value u, it takes at most 
(Kk) k steps to get one more counter with value u> by the argument given in 
Step 3. Therefore, along the shortest path to Th, either we reach a counting 



5 Recall that Z is the set of states that are winning in G for player 1 in randomized 
strategies. 
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Fig. 9. Almost-sure winning strategy may require more memory than posi- 
tive winning strategies. A one-sided reachability game where player 1 (round states) 
has perfect observation, player 2 (square states) is blind. Player 1 has a pure almost-sure 
winning strategy, but no pure belief-based memoryless strategy is almost-sure winning. 
However, player 1 has a pure belief-based memoryless strategy that is positive winning. 



function / with f(q) — uj for all q G Supp(/), or a counting function / with 
Supp(/) n T 7^ 0. In the first case, we can simulate a H in G to this point, and 
then win by simulating a winning randomized strategy, and in the second case 
the reachability objective Reach (T) is achieved in G with positive probability. 
Since the strategy a ensures that the support of the counting functions never 
hit the set Q\Qgi player 1 wins in G for the positive reachability and almost-sure 
safety objectives. 

Theorem 3. In one-sided partial- observation stochastic games with player 1 
perfect and player 2 partial, non- elementary size memory is sufficient for pure 
strategies to ensure positive probability reachability along with almost-sure safety 
for player 1; and hence for pure positive winning strategies for reachability ob- 
jectives for player 1 non- elementary memory bound is optimal. 

4.3 Upper bound for almost-sure reachability 

In this section we present the algorithm to solve the almost-sure reachability 
problem. We start with an example to illustrate that in general strategies for 
almost-sure winning may be more complicated than positive winning for reach- 
ability objectives. 

Example 4- Almost-sure winning strategy may require more memory 

than positive winning strategies. The example of Fig. 9 illustrates a key in- 
sight in the algorithmic solution of almost-sure reachability games where player 1 



27 



has perfect observation and player 2 has partial observation (he is blind in this 
case). For player 1, playing a in q\ and in qi is a positive winning strategy to 
reach qq. This is because from {gi,^}, the belief of player 2 becomes {93,94} 
and no matter the action chosen by player 2, the state q© is reached with positive 
probability from either q^ or 54. 

However, always playing a when the belief of player 2 is {q\, 92} is not almost- 
sure winning because if player 2 chooses always the same action (say a) in 
{93,94}) then with probability \ the state g© is not reached. Intuitively, this 
happens because player 2 can guess that the initial state is, say qi, and be right 
with positive probability (here ^)- To be almost-surely winning, player 1 needs to 
alternate actions a and b when the belief is {q±, 92}- The action b corresponds to 
the restart phase of the strategy, i.e. even assuming that player 2's belief would 
be, say {qi}, the action b ensures that g© is reached with positive probability 
by make the belief to be {q\, 52}- B 

Notation. We will consider T as the set of target states and without loss of 
generality assume that all target states are absorbing. In this section the belief 
of player 2 represents the set of states that can be with positive probability. 
Given strategies a and 7r for player 1 and player 2, respectively, a state q and 
a set K C Q we denote by Pr^'^-(-) the probability measure over sets of paths 
when the players play the strategies, the initial state is q and the initial belief 
for player 2 is K. 

In rest of this section we omit the subscript G (such as we write 77 instead 
of IIq) as the game is clear from the context. 

Bad states. Let T = Q\T. Let 

Qb = { q e Q I Va e S p • 3tt g 77° -. Pr^ } (Safe(T)) > } 

be the set of states q such that given the initial belief of player 2 is the singleton 
{q}, for all pure strategies for player 1 there is a counter observation-based 
strategy for player 2 to ensure that Safe(T) is satisfied with positive probability. 
We will consider Qb as the set of bad states. 

Property of an almost-sure winning strategy. Consider a pure almost-sure win- 
ning strategy for player 1 that ensures against all observation-based strategies 
of player 2 that T is reached with probability 1 . Then we claim that the belief of 
player 2 must never intersect with Qb'- otherwise if the belief intersects with Qb, 
let q be the state in Qb that is reached with positive probability. Then player 2 
simply assumes that the current state is q, updates the belief to {q}, and the 
guess is correct with positive probability. Given the belief is {q}, since q £ Qb, 
it follows that against all player 1 pure strategies there is an observation-based 
strategy for player 2 to ensure with positive probability that 7" is not reached. 
This contradicts that the strategy for player 1 is almost-sure winning. 

Transformation. We transform the game by changing all states in Qb as absorb- 
ing. Let Qq — Q\ Qb- By definition we have 

Qg = { q e Q I 3ct e S P ■ Vtt e 77° : Pr^ g} ( Reach (T)) = 1 }. 
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By the argument above that for a pure almost-sure winning strategy the belief 
must never intersect with Qb we have 



Qg- 


= { q G Q | 3<r g S p ■ Vtt g 77° : Pr^ } (Reach (T)) = 1 




andPr*;* } (Safe(Q\Q B )) = l}. 


Ql- 


= { q G Q | 3 CT g E p • Vtt g 77° : Pr^ g} (Reach (T)) > 



Let 

Q G = { q G Q | 3 C 

andPr^ } (Safe(Q\Q B )) = l}. 

We now show that Q G = Qg- The inclusion Qg C Q g is trivial, and we now 
show the other inclusion Q G C Qq. Observe that in Q G we have the property of 
positive reachability and almost-sure safety and we will use strategies for positive 
reachability and almost-sure safety to construct an almost-sure winning strategy. 
We consider Qb as the set of unsafe states (i.e., Qg is the safe set), and T as 
the target and invoke the results of the Section 4.2: for all q G Q G there is a pure 
finite-memory strategy a q of memory at most B (where B is non-elementary) 
to ensure that from q, within N = 2°^ B > steps, T is reached with probability at 
least some positive constant r/ q > 0, even when the initial belief for player 2 is 
{q}. Let rj = min„ g gp i] q . A pure finite- memory almost-sure winning strategy is 
described below. The strategy plays in two-phases: (1) the Restart phase; and 
(1) the Play phase. We define them as follows: 

1. Restart phase. Let the current state be q, assume that the belief for player 2 
is {q} and goto the Play phase with strategy a q that ensures that Qg is 
never left and T is reached within N steps with probability at least 77 > 0. 

2. Play phase. Let a be the strategy defined in the Restart phase, then play a 
for N steps and go back to the Restart phase. 

The strategy is almost-sure winning as for all states in Q G and for all histories, 
in every N steps the probability to reach T is at least r\ > 0, and Qg (and hence 
Q c ) is never left. Thus probability to reach T in N ■ I steps, for I g N, is at 
least 1 — (1 — rjY and this is 1 as £ — > 00. Thus the desired result follows and we 
obtain the almost-sure winning strategy. 

Memory bound and algorithm. The memory upper bound for the almost- 
sure winning strategy constructed is as follows: \Q\ ■ B + \ogN, we require \Q\ 
strategies of Section 4.2 of memory size B and a counter to count up to N — 
2°( B > steps. We now present an algorithm for almost-sure reachability that works 
in time 2^ Q \ x O(PosReachSureSafe), where PosReachSureSafe denote 
the complexity to solve the positive reachability along with almost-sure safety 
problem. The algorithm enumerates all subset Q' C Q and then verify that forall 
q G Q' player 1 can ensure to reach T with positive probability staying safe in Q' 
with probability 1. In other words the algorithm enumerates all subsets Q' C Q to 
obtain the set Qg- The enumeration is exponential and the verification requires 
solving the positive reachability with almost-sure safety problem. 

Theorem 4. In one-sided partial-observation stochastic games with player 1 
perfect and player 2 partial, non- elementary size memory is sufficient for 



29 



pure strategies to ensure almost-sure reachability for player 1; and hence for 
pure almost-sure winning strategies for reachability objectives for player 1 non- 
elementary memory bound is optimal. 

Corollary 1. In one-sided partial- observation stochastic games with player 1 
perfect and player 2 partial, the problem of deciding the existence of pure almost- 
sure and positive winning strategies for reachability objectives for player 1 can 
be solved in non- elementary time complexity. 

5 Finite-memory Strategies for Two-sided Games 

In this section we show the existence of finite- memory pure strategies for positive 
and almost-sure winning in two-sided games. 

5.1 Positive reachability with almost-sure safety 

Let T be the set of target states for reachability (such that all the target states 
are absorbing) and Qg be the set of good states for safety with T C Qg- Our 
goal is to show that for pure strategies to ensure positive probability reachability 
to T and almost-sure safety for Qg, finite-memory strategies suffice. Note that 
with Qg as the whole state space we obtain the result for positive reachability 
as a special case. 

Lemma 2. For all games G, for all q G Q, if there exists a pure strategy a G 
S° n S p such that for all strategies ir G U° of player 2 we have 

Pr^(Reach(T)) > and Pr^(Safe(Q G )) = 1; 

then there exists a finite-memory pure strategy o~* G S° D S p such that for all 
strategies ir G H° of player 2 we have 

Pif >,r (Reach(T)) > and Prf ' 7r (Safe(Q G )) = 1. 

We prove the result with the following two claims. We fix a (possibly infinite 
memory) strategy a G S° C\ S p such that for all strategies 7r G Tl° of player 2 
we have 

Pr^( Reach (T)) > and Pr^ 7r (Safe(Q G )) = 1. 

Claim 1. If there exists N G N such that for all strategies 7r G 77° of player 2 
we have 

Pr^Reach^CD) > and Pr^(Safe(Q G )) = 1 

where Reach- denotes reachability within first ./V-steps; then there exists a 
finite-memory pure strategy a* G S° fl S p such that for all strategies 7r G 77° 
of player 2 we have 

Prf '""(ReachCD) > and Prf ,7r (Safe(Q G )) = 1. 
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Proof. The finite-memory strategy a* is as follows: play like the strategy a 
for the first TV-steps, and then switch to a strategy to ensure Safe^c) with 
probability 1. The strategy ensure positive probability reachability to T as for 
the first TV-steps it plays like a and a already ensures positive reachability within 
TV-steps. Moreover, since a ensures Safe^c) with probability 1, it must also 
ensure Safe(Q G ) for the first TV-steps, and since a* after the first TV-steps only 
plays a strategy for almost-sure safety, it follows that a* guarantees Safe(Q G ) 
with probability 1. The strategy o' is a finite- memory strategy since it needs 
to play like a for the first iV-steps (which requires finite-memory) and then it 
switches to an almost-sure safety strategy for which exponential size memory is 
sufficient (for safety objective almost-sure winning coincides with sure winning 
and then belief-based strategies are sufficient; see [13] for details). □ 

Claim 2. There exists AfeN such that for all strategies 7r G 71° of player 2 we 
have 

Pr^Reach^CD) > and Pr^(Safe(Q G )) = 1 

where Reach- denotes reachability within first TV-steps. 

Proof. The proof is by contradiction. Towards contradiction, assume that for all 
neN, there exists a strategy ir n G Tl° such that either Pr^ ,7r " (Reach-™ (T)) = 
orPr^"(Safe(Q G ))<l. 

If for some n > wc have Pr^ 7r "(Safe(Qg)) < 1, then we get a con- 
tradiction with the fact that Pr^ ,7r (Safe(<5 G )) = 1 for all 7r G 77°. Hence 
Pr^ 7r "(Safe(Q G )) = 1 for all n G N, and therefore Pr^" (Reach- 11 (T)) = 
for all n G N. Equivalently, all play prefixes of length at most n and compati- 
ble with a and 7r„ avoid to hit T, and thus Pr^ 7r "(Safe-™(<5 \ T)) = 1 for all 
n G N. Note that we can assume that each strategy 7r n is pure because once the 
strategy a of player 1 is fixed we get a POMDP for player 2, and for POMDPs 
pure strategies are as powerful as randomized strategies [14] (in [14] the result 
was shown for finite POMDPs with finite action set, but the proof is based on 
induction on the action set and also works for countably infinite POMDPs). 

Using a simple extension of Konig's Lemma [29], we construct a strategy 
n' G 77° such that Pr^'" (Safe(Q \ T)) = 1. The construction is as follows. In 
the initial state q, there is an action &o G ^i which is played by infinitely many 
strategies 7r„. We define ir'(q) = &o and let Pq be the set {-K n \ ■n n {q) = &o}- 
Note that Pq is an infinite set. We complete the construction as follows. Having 
defined Tr'(p) for all play prefixes p of length at most k, and given the infinite 
set Pfc, we define n'(p') for all play prefixes p 1 of length k + 1 and the infinite set 
Pfc+i as follows. Consider the tuple b^ n G A™ of actions played by the strategy 
7r„ G Pk after the m prefixes p' of length k + 1. Clearly, there exists an infinite 
subset Pk+i of Pk in which all strategies play the same tuple bk+i- We define 
tt(p') using the tuple bk+i- This construction ensures that no play prefix of 
length k + 1 compatible with a and tr' hit the set 7", since ir' agrees with some 
strategy 7r„ for arbitrarily large n. Repeating this inductive argument yields a 
strategy n' such that Prjl'^ (Safe(<3\7~)) = 1, in contradiction with the fact that 
Pr^' 7r ( Reach (T)) > for all ir G 77°. Hence, the desired result follows. □ 
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The above two claims establish Lemma 2 and gives the following result. 

Theorem 5. In two-sided partial-observation stochastic games finite memory is 
sufficient for pure strategies to ensure positive probability reachability along with 
almost-sure safety for player 1; and hence for pure positive winning strategies for 
reachability objectives finite memory is sufficient and non- elementary memory 
is required in general for player 1. 

5.2 Almost-sure reachability 

We now show that for pure strategies for almost-sure reachability, finite-memory 
strategies suffice. The proof is a straight forward extension of the results of Sec- 
tion 4.3, and for finite-memory strategies for positive reachability with almost- 
sure safety we use the result of the previous subsection. 

Notation. We will consider T as the set of target states and without loss of 
generality assume that all target states are absorbing. In this section the belief 
of player 2 represents the set of states that can be with positive probability. 
Given strategies a and tt for player 1 and player 2, respectively, a state q and a 
set K C Q we denote by Pr^'^(-) the probability distribution when the players 
play the strategies, the initial state is q and the initial belief for player 2 is K. 

In rest of this section we omit subscript G (such as we write U° instead of 
IIq) as the game is clear from the context. 

Bad beliefs. Let T = Q\T '■ Let 

Qb = { B G 2 Q | Vct G E° n E p ■ 3n G 77° • 3q G B : Pr^ } (Safe(T)) > } 

be the set of beliefs B such that for all pure strategies for player 1 there is 
a counter strategy for player 2 with a state q € B to ensure that given the 
initial belief of player 2 is the singleton {q}, Safe(T) is satisfied with positive 
probability. We will consider Qb as the set of bad beliefs. 

Property of an almost-sure winning strategy. Consider a pure almost-sure win- 
ning strategy for player 1 that ensures against all strategies of player 2 that T is 
reached with probability 1. Then we claim that the belief of player 2 must never 
intersect with Qb ■ otherwise if the belief intersects with Qb , let B be the belief 
in Qb that is reached with positive probability. Then there exists q G B such 
that player 2 can simply assume that the current state is q, update the belief 
to {q}, and the guess is correct with positive probability, and then player 2 can 
ensure that against all player 1 pure strategies there is a strategy for player 2 to 
ensure with positive probability that T is not reached. This contradicts that the 
strategy for player 1 is almost-sure winning. Let Qc — 2^ \ Qb- By definition 
we have 

Q G = { B G 2 Q | Bcr £ S° n S p ■ Vtt e 77° • Vq g B : Pr^[ g} ( Reach (T)) = 1 }. 

By the argument above that for a pure almost-sure winning strategy the belief 
must never intersect with Qb we have 

Q G = { B G 2<2 | 3ct e E° n S p ■ Vtt G 17° • Vg G B : Pr^ } (Reach(T)) = 1 
andPr^ } (Safe(2«\Q B )) = l}. 
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Let 

Q P G = { B E 2® \ 3<r E £° D Z p -Vir E 11° -Vq E B : Pr^ } ( Reach (T)) > 
andPr^ g} (Safe(2«\Q B )) = l}. 

We now show that Q G = Qq. The inclusion Qc C Q g is trivial, and we now 
show the other inclusion Q G C Qq. Observe that in Q G we have the property of 
positive reachability and almost-sure safety and we will use strategies for positive 
reachability and almost-sure safety to construct a witness finite-memory almost- 
sure winning strategy. Note that here we have safety for a set of beliefs (instead of 
set of states, and it is straight forward to verify that the argument of the previous 
subsection holds when the safe set is a set of beliefs) . We consider Qb as the set 
of unsafe beliefs (i.e., Qc is the safe set), and T as the target and invoke the 
results of the previous subsection: for all B E Q G there is a pure finite-memory 
strategy <7g of to ensure that from all states q E B, within N steps (for some finite 
N £ N), T is reached with probability at least some positive constant t\b > 0, 
even when the initial belief for player 2 is {q}. Let r\ = mm Be QP r/B- A pure 
finite-memory almost-sure winning strategy is described below. The strategy 
plays in two-phases: (1) the Restart phase; and (1) the Play phase. We define 
them as follows: 

1. Restart phase. Let the current belief be B, the belief for player 2 is any perfect 
belief {q}, for q £ B; and goto the Play phase with strategy as that ensures 
that Qc is never left and T is reached within N steps with probability at 
least i] > 0. 

2. Play phase. Let a be the strategy defined in the Restart phase, then play a 
for N steps and go back to the Restart phase. 

The strategy is almost-sure winning as for all states in Q G and for all histories, 
in every N steps the probability to reach T is at least 77 > 0, and Qc (and hence 
Qq) is never left. Thus probability to reach T in N ■ i steps, for t £ N, is at 
least 1 — (1 — r\f and this is 1 as I — > 00. Thus the desired result follows and we 
obtain the required finite-memory almost-sure winning strategy. 

Memory bound and algorithm. The memory upper bound for the almost- 
sure winning strategy constructed is as follows: 12^1 • B + log N, we require |2^ 
strategies of the previous subsection of memory size B and a counter to count 
up to N steps; where B is the memory required for strategies to ensure positive 
reachability with almost-sure safety objectives. 

Theorem 6. In two-sided partial-observation stochastic games finite memory is 
sufficient (and non- elementary memory is required in general) for pure strategies 
for almost- sure winning for reachability objectives for player 1. 

6 Equivalence of Randomized Action-invisible Strategies 
and Pure Strategies 

In this section, we show that for two-sided partial-observation games, the prob- 
lem of almost-sure winning with randomized action-invisible strategies is inter- 
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reducible with the problem of almost-sure winning with pure strategies. The 
reductions are polynomial in the number of states in the game (the reduction 
from randomized to pure strategies is exponential in the number of actions). 

It follows from the reduction of pure to randomized action-invisible strate- 
gies that the memory lower bounds for pure strategies transfer to randomized 
strategies, and in particular belief-based mcmoryless strategies are not sufficient, 
showing that a remark (without proof) of [16, p. 4] and the result and construc- 
tion of [26, Theorem 1] are wrong. 

6.1 Reduction of randomized action-invisible strategies to pure 
strategies 

We give a reduction for almost-sure winning for randomized action-invisible 
strategies to pure strategies. Given a stochastic game G we will construct another 
stochastic game H such that there is a randomized action-invisible almost-sure 
winning strategy in G iff there is a pure almost-sure winning strategy in H . We 
first show in Lemma 3 the correctness of the reduction for finite- memory random- 
ized action-invisible strategies, and then show in Lemma 4 that finite memory is 
sufficient in two-sided partial-observation games for randomized action-invisible 
strategies. 

Construction. Given a stochastic game G = (Q,qa,6) over action sets A\ 
and .4.2, and observations 0\ and 02 (along with the corresponding observation 
mappings obsi and obs2), we construct a game H = {Q,Qo,Sh) over action sets 
2 Al \ {0} and Ai and observations Oy and 02- The transition function Sh is 
defined as follows: 

- for all q € Q and A e 2 Al \ {0} and b e A 2 we have S H (q,A,b)(q') = 
rj-\'^2 a£ A ^(l' a > b){q'), i-6-, in a state in Q player 1 selects a non-empty subset 
A C A\ of actions and the transition function Sh simulates the transition 
function S along with the uniform distribution over the set A of actions. 

The observation mappings obSi in H, for i £ { 1, 2 } arc as follows: obs^ (q) = 
obsi(g), where obs^ is the observation mapping in G. 

Lemma 3. The following assertions hold for reachability objectives: 

1. If there is a pure almost-sure winning strategy in H, then there is a random- 
ized action-invisible almost- sure winning strategy in G. 

2. If there is a finite-memory randomized action-invisible almost-sure winning 
strategy in G, then there is a pure almost-sure winning strategy in H . 

Proof. We present both parts of the proof below. 

1. Let o~h be a pure almost-sure winning strategy in H. We construct a random- 
ized action-invisible almost-sure winning strategy gq in G. The strategy ac 
is as constructed as follows. Let pa = go9i • • ■ 9fe be a play prefix in G, and we 
consider the same play prefix pu = qoqi . . .qu in H, and let Ak = o-r{ph)- 
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The strategy gg(pg) plays all actions in Ak uniformly at random. Since gh 
is an almost-sure winning strategy it follows gg is also almost-sure winning. 
Also observe that if gh is observation-based, then so is gg- 
2. Let cjg be a finite- memory randomized action-invisible almost-sure winning 
strategy in G. If the strategy gg is fixed in G we obtain a finite POMDP, 
and by the results of [15] it follows that in an POMDP the precise transition 
probabilities do not affect almost-sure winning. Hence if gq is almost-sure 
winning, then the uniform version a G of the strategy gq that always plays 
the same support of the probability distribution as gq but plays all actions 
in the support uniformly at random is also almost-sure winning. Given g g we 
construct a pure almost-sure winning strategy gh in H. Given a play prefix 
Ph = <Zo9i • • • Qk in H , consider the same play prefix pa = qaqi • • • <tk in G. 
Let Ak = Supp(gq(pg)), then gh(ph) plays the action Ak £ (2 Al \ {0}). 
Since g g is almost-sure winning it follows that gr is almost-sure winning. 
Observe that if gq is observation-based, then so is g g , and then so is gh- 

The desired result follows. □ 

Lemma 4. For reachability objectives, if there exists a randomized action- 
invisible almost-sure winning strategy in G, then there exists also a finite-memory 
randomized action-invisible almost-sure winning strategy in G. 

Proof. Let W = { B | B £ 2 Q is the belief of player 1 such that 3g £ S° ■ W G 
n° -VqeB : Pr^(Reach(T)) = 1 } denote the set of belief sets B for player 1 
such that player 1 has a (possibly infinite-memory) randomized action-invisible 
almost-sure winning strategy from all starting states in B. It follows that the 
almost-sure winning strategy must ensure that the set W is never left: this is 
because from the complement set of W against all randomized action-invisible 
for player 1 there is a counter strategy for player 2 to ensure that with positive 
probability the target is not reached. Moreover for all B £ W the almost-sure 
winning strategy also ensures that 7" is reached with positive probability. Hence 
we have again the problem of positive reachability with almost-sure safety. We 
simply repeat the proof for the pure strategy case, treating sets of actions (that 
is the support of the randomized strategy) as actions (for pure strategy) and 
played uniformly at random (as in the reduction from G to H), and thus obtain 
a witness finite-memory strategy gq to ensure positive reachability and almost- 
sure safety. Repeating the strategy gq with play phase and repeat phase (as 
in the case of pure strategies) we obtain the desired finite-memory almost-sure 
winning strategy. □ 

The following theorem follows from the previous two lemmas. 

Theorem 7. Given a two-sided (resp. one-sided) partial-observation stochastic 
game G with a reachability objective we can construct in time polynomial in 
the size of the game and exponential in the size of the action sets a two-sided 
(resp. one-sided) partial-observation stochastic game H such that there exists a 
randomized action-invisible almost-sure winning strategy in G iff there exists a 
pure almost-sure winning strategy in H. 
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For positive winning, randomized mcmoryless strategies are sufficient (both 
for action- visible and action-invisible) and the problem is PTIME-complete for 
one-sided and EXPTIME-complctc for two-sided [7]. The above theorem along 
with Theorem 1 gives us the following corollary for almost-sure winning for 
randomized action-invisible strategies. 

Corollary 2. Given one-sided partial-observation stochastic games with 
player 1 partial and player 2 perfect, the following assertions hold for reach- 
ability objectives for player 1: 

1. (Memory complexity). Exponential memory is sufficient for randomized 
action-invisible strategies for almost-sure winning. 

2. (Algorithm). The existence of a randomized action-invisible almost-sure win- 
ning strategy can be decided in time exponential in the state space of the game 
and exponential in the size of the action sets. 

3. (Complexity). The problem of deciding the existence of a randomized action- 
invisible almost-sure winning strategy is EXP TIME -complete. 

6.2 Reduction of pure strategies to randomized action-invisible 
strategies 

We present a reduction for almost-sure winning for pure strategies to random- 
ized action-invisible strategies. Given a stochastic game G we construct another 
stochastic game H such that there exists a pure almost-sure winning strategy in 
G iff there exists a randomized almost-sure winning strategy in H. 

The idea of the reduction is to force player 1 to play a pure strategy in H. 
The game H simulates G and requires player 1 to repeat each actions played 
(i.e. to play each action two times) . Then, if player 1 uses randomization, he has 
to repeat the actions chosen randomly in the previous step. Since the actions are 
invisible, this can be achieved only if the support of the randomized actions is a 
singleton, i.e., the strategy is pure. Note that the reduction works for randomized 
strategies with actions invisible, and not when the actions are visible. 

Construction. Given a stochastic game G = {Q,qo,Sc) over action sets A\ 
and A2, and observations 0\ and O2 (along with the corresponding observation 
mappings obsi andobs2), we construct agameiJ = (QL)(Qx Ai)U{sink},g , Sh) 
over the same action sets A\ and A 2 and observations 0\ and Oi- The transition 
function Sh is defined as follows: 

— for all q G Q and aeii and b G A2 we have 5n{q, a, b)((q,a)) = 1, i.e., in 
a state q for action a of player 1, irrespective of the choice of player 2, the 
game stores player l's action with probability 1; 

— for all (q,a) G QxAi, for all b G A 2 we have 5n{{q, a), a, b) = 5a(q,a,b), i.e. if 
player 1 repeats the action played in the previous step, then the probabilistic 
transition function is the same as in G; and for all a' G A\ \ {a}, we have 
Sfi({q, a), a, b) (sink) = 1, i.e. if player 1 does not repeat the same action, then 
the sink state is reached. 

— for all a e Ai and b G A 2l we have fe(sink,a, 6)(sink) = 1. 
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The observation mappings obs i in H (i <G {1,2}) are as follows: obs^ (q) = 
obs,j ((q, a)) = obSj(g), where obs^ is the observation mapping in G. Note that 
H is of size polynomial in the size of G. 

Lemma 5. Let T C Q be a set of target states. There exists a pure almost- 
sure winning strategy in G for Reach (7~) if and only if there exists a randomized 
action- invisible almost-sure winning strategy in H for objective Reach(7~). 

Proof. We present both directions of the proof below. 

1. Let an be a randomized action-invisible almost-sure winning strategy in 
H. We show that we can assume wlog that on is actually a pure strat- 
egy. To see this, assume that under strategy o~h there is a prefix pn = 
QoilOi o-o)Qi{<11i di) ■■ -Qk m H compatible with oh from which an plays a 
randomized action with support A C A\ and \A\ > 1. Then, with positive 
probability the states (</&, ak) and (qk,a' k ) are reached where ak,a' k € A and 
a k ¥" a 'k- N° matter the action(s) played by o~h in the next step, the state 
sink is reached with positive probability in the next step, either from (qk, ak) 
or from (</&, a' k ). This contradicts that <jh is almost-sure winning. Therefore, 
we can assume that an is a pure strategy that repeats each action two times. 
We construct a pure almost-sure winning strategy in G by removing these 
repetitions. 

2. Let og be a pure almost-sure winning strategy in G. Consider the strategy 
o~h in H that always repeats two times the actions played by og- The strat- 
egy o~h is observation-based and almost-sure winning since H simulates G 
when actions are repeated twice. 

The desired result follows. □ 



Theorem 8. Given a two-sided partial-observation stochastic game G with a 
reachability objective we can construct in time polynomial in the size of the game 
and size of the action sets a two-sided partial-observation stochastic game H 
such that there exists a pure almost-sure winning strategy in G iff there exists a 
randomized action-invisible almost-sure winning strategy in H . 

Belief-based strategies are not sufficient. We illustrate our reduction with 
the following example that shows belief-based (belief-only) randomized action- 
invisible strategics are not sufficient for almost-sure reachability in one-sided 
partial-observation games (player 1 partial and player 2 perfect), showing that 
a remark (without proof) of [16, p. 4] and the result and construction of [26, 
Theorem 1] are wrong. 

Example 5. We illustrate the reduction of on the example of Fig. 1. The result of 
the reduction is given in Fig. 10. Remember that Example 1 showed that belief- 
based pure strategies are not sufficient for almost-sure winning. We show that 
belief-based randomized strategies are not sufficient for almost-sure winning in 
the game of Fig. 10. First, in {91,(72} player 1 has to play pure since he has to 
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Fig. 10. Belief-based strategies are not sufficient. The game graph obtained by 
the reduction of pure to randomized strategies on the game of Fig. 1 (for almost-sure 
reachability objective). Player 1 is blind and player 2 has perfect observation. There 
exists an almost-sure winning randomized strategy (with invisible actions), but there 
is no belief-based memoryless almost-sure winning randomized strategy. 



be able to repeat the same action to avoid reaching a sink state © with positive 
probability. Now, the argument is the same as in Example 1: playing always the 
same action (either a or b) in {91,(72} is not even positive winning as player 2 
can choose the state in this set (either q 2 or q{). ■ 

Note that our reduction preserves the structure and memory of almost-sure 
winning strategies, hence the non-elementary lower bound given in Theorem 2 
for pure strategies also transfers to randomized action-invisible strategies by the 
same reduction. 

Corollary 3. For one-sided partial- observation stochastic games, with player 1 
partial and player 2 perfect, belief-based randomized action-invisible strategies are 
not sufficient for almost-sure winning for reachability objectives. For two-sided 
partial-observation stochastic games, memory of non- elementary size is neces- 
sary in general for almost-sure winning for randomized action-invisible strategies 
for reachability objectives. 
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